AITopics | gradient norm

Physics-informed machine learning (PIML) integrates mechanistic knowledge, typically in the form of partial differential equations (PDE), into data-driven models. Despite strong empirical performance, its statistical generalisation properties remain poorly understood, particularly in the regression setting with unbounded losses. Existing analyses rely on approximation or stability arguments and do not fully capture how physical structure influences generalisation from finite data. In this work, we develop a PAC-Bayesian framework for PIML that provides high-probability generalisation guarantees in the presence of unbounded losses. We adopt a multi-task perspective that jointly treats data fidelity, PDE residuals, initial and boundary conditions, avoiding the looseness induced by standard union-bound approaches. Our analysis leverages the structure of physics-informed objectives to derive novel bounds where the complexity scales with input-gradient norms of the losses, revealing a direct link between physical regularity and generalisation. We instantiate this framework under Sobolev and Poincaré-type assumptions, yielding two classes of bounds that trade off statistical complexity and smoothness in different regimes. Building on these results, we propose a self-bounding-aware learning algorithm that directly optimises tractable surrogates of the derived bounds, along with a practical procedure to estimate the associated constants in realistic settings. Empirical evaluations on standard PDE benchmarks demonstrate that our bounds are non-vacuous, significantly tighter than union-bound baselines, and can be effectively minimised during training. Overall, our results provide a principled statistical foundation for the generalisation of physics-informed models.

artificial intelligence, assumption, machine learning, (17 more...)

arXiv.org Machine Learning

2605.26341

Country: Europe (1.00)

Genre: Research Report > New Finding (0.48)

Add feedback

ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

Defazio, Aaron

arXiv.org Machine LearningMay-20-2026

Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2605.19095

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Robust and Fast Training via Per-Sample Clipping

Nobile, Davide, Grohs, Philipp

arXiv.org Machine LearningMay-5-2026

We propose a robust gradient estimator based on per-sample gradient clipping and analyze its properties both theoretically and empirically. We show that the resulting method, per-sample clipped SGD (PS-Clip-SGD), achieves optimal in-expectation convergence rates for non-convex optimization problems under heavy-tailed gradient noise. Moreover, we establish high-probability convergence guarantees that match the in-expectation rates up to polylogarithmic factors in the failure probability. We complement our theoretical results with multiple numerical experiments. In particular, we demonstrate that PS-Clip-SGD outperforms both vanilla SGD with momentum and standard gradient clipping when training AlexNet on the CIFAR-100 dataset, even after accounting for the additional computational time caused by per-sample clipping. We also empirically show that, in the presence of gradient accumulation, applying clipping at the mini-batch level can improve training performance while incurring virtually no additional computational cost. This finding is particularly interesting, as it contradicts the common practice of applying clipping only after all accumulation steps have been completed.

artificial intelligence, clip-sgd, machine learning, (17 more...)

arXiv.org Machine Learning

2605.02701

Country: North America > Canada > Ontario (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

AMissing Proofs Theorem 1. The excessive loss of a group a Ais upper bounded by3: R(a) gℓa θ θ + 1 2 λ Hℓa θ θ

Neural Information Processing SystemsMay-1-2026, 02:47:32 GMT

J( θ; Da) is the Hessian matrix of the loss function ℓ, at the optimal parameters vector θ, computed using the group data Da (henceforth simply referred to as group hessian), and λ(Σ) is the maximum eigenvalue of a matrix Σ. Proof. Using a second order Taylor expansion around θ, the excessive loss R(a) for a group a A can be stated as: R(a) = J( θ; Da) J( θ; Da) = " J θ; Da + θ θ Hℓa θ θ +O θ θ 3 The above, follows from the loss ℓ() being at least twice differentiable, by assumption. Consider two groups a and b in Awith |Da| |Db|. Proposition 2. For a given group a A, gradient norms can be upper bounded as: gℓa O X The above proposition is presented in the context of cross entropy loss or mean squared error loss functions. These two cases are reviewed as follows 3With a slight abuse of notation, the results refer to θ as the homonymous vector which is extended with k k zeros.

artificial intelligence, gradient norm, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.71)

Add feedback